{"nbformat":4,"nbformat_minor":5,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.10"},"colab":{"name":"systematic-reviewpy tutorial.ipynb","provenance":[],"collapsed_sections":["bjNb91Ujh24e","TTbVIzuMVDsl","l1kE_nEkaPTh","5gqcBKW5YKWb","1dab0dbd","CquSYFX-hJM-","35b8ffdc","a7c2a30f","86d72e76","F7B8E5eghhHh","d9c52096"]}},"cells":[{"cell_type":"markdown","metadata":{"tags":[],"id":"b15a3e12"},"source":["# Quick Tutorial"],"id":"b15a3e12"},{"cell_type":"markdown","metadata":{"id":"bjNb91Ujh24e"},"source":["### Installation"],"id":"bjNb91Ujh24e"},{"cell_type":"markdown","metadata":{"id":"TTbVIzuMVDsl"},"source":["#### Required Dependencies"],"id":"TTbVIzuMVDsl"},{"cell_type":"code","metadata":{"id":"73336c13"},"source":["!pip install rispy\n","!pip install pandas\n","!pip install matplotlib\n","!pip install seaborn"],"id":"73336c13","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1q4bRlAxVYKC"},"source":["#### installing the systematic-reviewpy"],"id":"1q4bRlAxVYKC"},{"cell_type":"code","metadata":{"id":"gIHx_f6jsNzG"},"source":["!python3 -m pip install systematic-reviewpy"],"id":"gIHx_f6jsNzG","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ZAUimZui4S7u"},"source":["google colab Jupyter notebook Instruction : \n","`Ctrl m m` will convert a code cell to a text cell. \n"," `Ctrl m y` will convert a text cell to a code cell. "],"id":"ZAUimZui4S7u"},{"cell_type":"markdown","metadata":{"id":"0UexV0R9ZCUL"},"source":["##### install pdftotext dependencies: Installing needed python pdf readers for validation and search count of pdf text."],"id":"0UexV0R9ZCUL"},{"cell_type":"markdown","metadata":{"id":"mKtjHh8aV2L5"},"source":["Please run cell based on your OS and keep other cells as markdown."],"id":"mKtjHh8aV2L5"},{"cell_type":"code","metadata":{"id":"oFSqx54JZTI8"},"source":["##### Debian, Ubuntu, and friends\n","!sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev"],"id":"oFSqx54JZTI8","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Fs7aPlANZy_O"},"source":["##### Fedora, Red Hat, and friends\n","!sudo yum install gcc-c++ pkgconfig poppler-cpp-devel python3-devel"],"id":"Fs7aPlANZy_O"},{"cell_type":"markdown","metadata":{"id":"fVGqb4gZZS6n"},"source":["##### macOS\n","!brew install pkg-config poppler python"],"id":"fVGqb4gZZS6n"},{"cell_type":"markdown","metadata":{"id":"uHPawm47a0RC"},"source":["##### Windows using conda\n","!conda install -c conda-forge poppler"],"id":"uHPawm47a0RC"},{"cell_type":"markdown","metadata":{"id":"l1kE_nEkaPTh"},"source":["#### Install python pdf readers"],"id":"l1kE_nEkaPTh"},{"cell_type":"code","metadata":{"id":"2a11d4e7"},"source":["## https://pypi.org/project/PyMuPDF/\n","!python -m pip install --upgrade pip\n","!python -m pip install --upgrade pymupdf\n","## https://pypi.org/project/pdftotext/\n","!pip install pdftotext"],"id":"2a11d4e7","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"5gqcBKW5YKWb"},"source":["#### importing the systematic-reviewpy"],"id":"5gqcBKW5YKWb"},{"cell_type":"code","metadata":{"id":"b75c22a4"},"source":["import systematic_review"],"id":"b75c22a4","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"vVYt0_1Vcg9d"},"source":["Most of the object contains methods like to_csv and to_excel to output files"],"id":"vVYt0_1Vcg9d"},{"cell_type":"markdown","metadata":{"id":"xWz3svaVk7oD"},"source":["Check documentation for more string manipulation methods : \n","- preprocess_string (default and applied before all other implemented functions)\n","- custom_text_manipulation_function : for putting your custom_text_manipulation_function function to preprocess the text\n","- nltk_remove_stopwords\n","- pattern_lemma_or_lemmatize_text \n","- nltk_word_net_lemmatizer \n","- nltk_porter_stemmer\n","- nltk_lancaster_stemmer \n","- spacy_lemma \n","- nltk_remove_stopwords_spacy_lemma \n","- convert_string_to_lowercase\n","- preprocess_string_to_space_separated_words"],"id":"xWz3svaVk7oD"},{"cell_type":"markdown","metadata":{"id":"opwNm_pHm5Be"},"source":["Please provide name of string manipulation method."],"id":"opwNm_pHm5Be"},{"cell_type":"code","metadata":{"id":"da91fc84"},"source":["string_manipulation_method = 'convert_string_to_lowercase'"],"id":"da91fc84","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"jp-MarkdownHeadingCollapsed":true,"tags":[],"id":"1dab0dbd"},"source":["## Optional Converting and wrangling citation files"],"id":"1dab0dbd"},{"cell_type":"markdown","metadata":{"id":"79df2b57"},"source":["wrangling or modification of the citation files is required if there is format error while uploading files into reference manager."],"id":"79df2b57"},{"cell_type":"code","metadata":{"id":"75ec5c5b"},"source":["#citation.csv_citations_to_ris_converter(\"./Data files and Python Code/Downloaded files/springer.csv\", \"./Data files and Python Code/Modified files/springer.ris\")"],"id":"75ec5c5b","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"97da36da"},"source":["#citation.remove_empty_lines(\"./Data files and Python Code/Downloaded files/entropy-v12-i12_20210610.ris\", \"./Data files and Python Code/Modified files/MDPI.ris\")"],"id":"97da36da","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"974f2b51"},"source":["#citation.edit_ris_citation_paste_values_after_regex_pattern(\"./Data files and Python Code/Modified files/MDPI.ris\", \"./Data files and Python Code/Modified files/mdpi.ris\")"],"id":"974f2b51","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"a09eaf2e"},"source":["#import os\n","#os.remove(\"./Data files and Python Code/Modified files/MDPI.ris\")"],"id":"a09eaf2e","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"CquSYFX-hJM-"},"source":["## Citations"],"id":"CquSYFX-hJM-"},{"cell_type":"markdown","metadata":{"tags":[],"id":"35b8ffdc"},"source":["### All files are uploaded to mendeley reference manager, updated using mendeley database, and downloaded in ris format."],"id":"35b8ffdc"},{"cell_type":"markdown","metadata":{"id":"d3cb8a5b"},"source":["Please provide the path of the folder that contains all citations ris files."],"id":"d3cb8a5b"},{"cell_type":"code","metadata":{"tags":[],"id":"b0bdbb00"},"source":["CITATIONS_FILES_PARENT_DIR_PATH = \"./Data files and Python Code/Articles_by_sources\""],"id":"b0bdbb00","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"31116f59"},"source":["citations = systematic_review.citation.Citations(CITATIONS_FILES_PARENT_DIR_PATH)"],"id":"31116f59","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"c4b9d88f"},"source":["citations_df = citations.get_dataframe()\n","citations_df"],"id":"c4b9d88f","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"tags":[],"id":"a7c2a30f"},"source":["### Search Words"],"id":"a7c2a30f"},{"cell_type":"markdown","metadata":{"id":"d4fd5a18"},"source":["Please provide the path of search_words.json or make keyword dictionary."],"id":"d4fd5a18"},{"cell_type":"code","metadata":{"id":"fde647c5"},"source":["systematic_review.search_count.SearchWords().get_sample_keywords_json()"],"id":"fde647c5","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"M2u7UrIGgA1K"},"source":["Edit the template based on your need and provide the file path in cell below. if filename and location is not changed no need to change anything."],"id":"M2u7UrIGgA1K"},{"cell_type":"code","metadata":{"id":"9450ee40"},"source":["#KEYWORDS_JSON_FILE_PATH = \"./Data files and Python Code/keywords.json\"\n","SEARCH_WORDS_JSON_FILE_PATH = \"./sample_search_words_template.json\""],"id":"9450ee40","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"adb320d6"},"source":["search_words = systematic_review.search_count.SearchWords(SEARCH_WORDS_JSON_FILE_PATH, string_manipulation_method)"],"id":"adb320d6","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"2a6e4624"},"source":["print(search_words.value)"],"id":"2a6e4624","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"tags":[],"id":"86d72e76"},"source":["### Search and count words in citations"],"id":"86d72e76"},{"cell_type":"code","metadata":{"id":"434ab5ef"},"source":["citations_search_words_count = systematic_review.search_count.SearchCount(citations_df, search_words, string_manipulation_method)"],"id":"434ab5ef","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"scrolled":true,"id":"d05dbb3c"},"source":["citations_search_words_count_df = citations_search_words_count.get_dataframe()\n","citations_search_words_count_df"],"id":"d05dbb3c","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"d2bc32b3"},"source":["citations_search_words_count.to_csv(\"./Data files and Python Code/OutputFiles/citations_keywords_count_df.csv\")"],"id":"d2bc32b3"},{"cell_type":"markdown","metadata":{"tags":[],"id":"dd98d0af"},"source":["### Sort and Filter the citations"],"id":"dd98d0af"},{"cell_type":"markdown","metadata":{"id":"056a695b"},"source":["Please provide how many research papers needed."],"id":"056a695b"},{"cell_type":"code","metadata":{"id":"b6dc5641"},"source":["# Filter the citations to required number\n","required_citations_number = 500"],"id":"b6dc5641","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"8c4444a5"},"source":["filter_sorted_citations = systematic_review.filter_sort.FilterSort(citations_search_words_count_df, search_words, required_citations_number)"],"id":"8c4444a5","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"b1196411"},"source":["filter_sorted_citations_df = filter_sorted_citations.get_dataframe()"],"id":"b1196411","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"95a2c156"},"source":["print(len(filter_sorted_citations_df))"],"id":"95a2c156","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"713f0e4f"},"source":["filter_sorted_citations.to_csv(\"./Data files and Python Code/OutputFiles/filter_sorted_citations_df.csv\")"],"id":"713f0e4f"},{"cell_type":"markdown","metadata":{"id":"F7B8E5eghhHh"},"source":["## Research paper files"],"id":"F7B8E5eghhHh"},{"cell_type":"markdown","metadata":{"tags":[],"id":"0ec85414"},"source":["### Downloading above selected pdf from databases."],"id":"0ec85414"},{"cell_type":"markdown","metadata":{"id":"c2dca8b2"},"source":["This is completed with [browser-automationpy](https://github.com/chandraveshchaudhari/browser-automationpy)"],"id":"c2dca8b2"},{"cell_type":"markdown","metadata":{"tags":[],"id":"918a472b"},"source":["### Validating the downloaded articles"],"id":"918a472b"},{"cell_type":"markdown","metadata":{"id":"Yb4frDf8hAId"},"source":["Please provide parent directory path of all downloaded research papers."],"id":"Yb4frDf8hAId"},{"cell_type":"code","metadata":{"id":"bd81e852"},"source":["DOWNLOADED_ARTICLES_PATH = \"./Data files and Python Code/downloadedArticles\""],"id":"bd81e852","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"P04lsOaBivck"},"source":["Please provide path of text file containing names of research papers separated by new line OR write None."],"id":"P04lsOaBivck"},{"cell_type":"code","metadata":{"id":"YKqz5YMqkGuL"},"source":["IN_ACCESSIBLE_ARTICLES_TEXT_FILE_PATH = \"./Data files and Python Code/not_accessible_articles.txt\""],"id":"YKqz5YMqkGuL","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"5e7e90ce"},"source":["validation = systematic_review.validation.Validation(filter_sorted_citations_df, DOWNLOADED_ARTICLES_PATH, IN_ACCESSIBLE_ARTICLES_TEXT_FILE_PATH)"],"id":"5e7e90ce","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"scrolled":true,"id":"e27cab5c"},"source":["validated_research_papers = validation.get_dataframe()"],"id":"e27cab5c","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"f90ab757"},"source":["validation.info()"],"id":"f90ab757","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"1a2ea0e1"},"source":["validation.to_csv(\"validation.csv\")"],"id":"1a2ea0e1"},{"cell_type":"markdown","metadata":{"tags":[],"id":"8ff34466"},"source":["### Search and count the research papers files."],"id":"8ff34466"},{"cell_type":"code","metadata":{"id":"733f9288"},"source":["research_paper_search_words_count = systematic_review.search_count.SearchCount(validated_research_papers, search_words, string_manipulation_method)"],"id":"733f9288","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"747141bc"},"source":["research_paper_search_words_count_df = research_paper_search_words_count.get_dataframe()"],"id":"747141bc","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"24640637"},"source":["research_paper_search_words_count.to_csv(\"./Data files and Python Code/OutputFiles/pdf_keywords_count_df.csv\")"],"id":"24640637"},{"cell_type":"markdown","metadata":{"tags":[],"id":"23874447"},"source":["### Filter and sort pdf counted df"],"id":"23874447"},{"cell_type":"markdown","metadata":{"id":"Rp--8LkLmSCl"},"source":["Please provide how many research papers needed."],"id":"Rp--8LkLmSCl"},{"cell_type":"code","metadata":{"id":"86b16847"},"source":["required_full_text_documents = 100"],"id":"86b16847","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"0aed2892"},"source":["filter_sorted_research_papers = systematic_review.filter_sort.FilterSort(research_paper_search_words_count_df, search_words, required_full_text_documents)"],"id":"0aed2892","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"35595644"},"source":["selected_review_articles_df = filter_sorted_research_papers.get_dataframe()"],"id":"35595644","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"e887c46c"},"source":["filter_sorted_research_papers.to_csv(\"./Data files and Python Code/OutputFiles/selected_review_articles_df.csv\")"],"id":"e887c46c"},{"cell_type":"markdown","metadata":{"id":"23IlaaDAodmF"},"source":["## Generating research papers review files: \n","choose any of following"],"id":"23IlaaDAodmF"},{"cell_type":"markdown","metadata":{"id":"BEZ9n6KYqW6p"},"source":["- sorted based on sources: to make it easier to find articles in folder."],"id":"BEZ9n6KYqW6p"},{"cell_type":"code","metadata":{"id":"134e8b93"},"source":["sorted_Finaldf = systematic_review.filter_sort.sort_dataframe_based_on_column(selected_review_articles_df, 'source')"],"id":"134e8b93","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"fe11ee3f"},"source":["#sorted_Finaldf.to_csv(\"./Data files and Python Code/OutputFiles/sorted_Finaldf.csv\")"],"id":"fe11ee3f","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"ea8e6182"},"source":["- Creating the sample literature review file: \n","by adding review columns to enter details manually. The keywords counts are not required at this point of the time, so they are dropped. "],"id":"ea8e6182"},{"cell_type":"code","metadata":{"id":"e1e84dc8"},"source":["selected_citation = systematic_review.citation.drop_search_words_count_columns(sorted_Finaldf, search_words)\n","selected_citation_review = systematic_review.analysis.creating_sample_review_file(selected_citation)"],"id":"e1e84dc8","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"19df5af9"},"source":["selected_citation_review.to_csv(\"./Data files and Python Code/OutputFiles/selected_citation_review.csv\")"],"id":"19df5af9"},{"cell_type":"markdown","metadata":{"id":"d9c52096"},"source":["## Sytematic Review Workflow diagram and info"],"id":"d9c52096"},{"cell_type":"code","metadata":{"id":"39d6daf3"},"source":["my_analysis = systematic_review.analysis.SystematicReviewInfo(CITATIONS_FILES_PARENT_DIR_PATH, filter_sorted_citations_df,\n"," validated_research_papers, selected_review_articles_df)"],"id":"39d6daf3","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"8fa28712"},"source":["my_analysis.info()"],"id":"8fa28712","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"fd9f34ab"},"source":["my_analysis.systematic_review_diagram()"],"id":"fd9f34ab","execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"gZK7stGLruKj"},"source":["## Analysis"],"id":"gZK7stGLruKj"},{"cell_type":"markdown","metadata":{"id":"94f807ef"},"source":["| Analysis needed | Fact table | Diagram |\n","| --------------------------------------------------- | ---------- | ------- |\n","| The number of articles | yes | no |\n","| Period of the publications | yes | yes |\n","| Number of authors | yes | no |\n","| Articles with single authors | yes | no |\n","| Articles per authors | yes | no |\n","| Authors per articles | yes | no |\n","| Top N countries with the highest number of articles | yes | yes |\n","| Top N journals with the highest number of articles | yes | yes |\n","| Top N keywords most used in the articles | yes | yes |\n","| The year with the highest number of articles | yes | yes |"],"id":"94f807ef"},{"cell_type":"code","metadata":{"id":"e4ce96b5"},"source":["my_cite_analysis = systematic_review.analysis.CitationAnalysis(sorted_Finaldf)"],"id":"e4ce96b5","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"ae38babc"},"source":["my_cite_analysis.publication_year_info()"],"id":"ae38babc","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"be3a2289"},"source":["my_cite_analysis.publication_year_diagram()"],"id":"be3a2289","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"be36d03a"},"source":["my_cite_analysis.authors_info()"],"id":"be36d03a","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"73d4ffb4"},"source":["my_cite_analysis.publication_place_info()"],"id":"73d4ffb4","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"d8ff6959"},"source":["my_cite_analysis.publication_place_diagram()"],"id":"d8ff6959","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"scrolled":true,"id":"f723d147"},"source":["my_cite_analysis.keywords_info()"],"id":"f723d147","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"scrolled":true,"id":"442ad526"},"source":["my_cite_analysis.keyword_diagram(top_result=10)"],"id":"442ad526","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"89059183"},"source":["my_cite_analysis.publisher_info()"],"id":"89059183","execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"111bafd8"},"source":["my_cite_analysis.publisher_diagram()"],"id":"111bafd8","execution_count":null,"outputs":[]}]}